Wine quality by Prasad Pagade (1/25/2017)

========================================================

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

There are 1599 observations with 13 variables. Most wine quality are in the median range of 6. Observed large difference between mean and max values for variables like free.sulphur.dioxide, total.sulphur.dioxide and sugar.

Univariate Plots Section

Let us take a first look at some variables by plotting them below

1> Fixed acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity is long-tailed distribution. The log transform does not reveal anything new but it normalizes the distribution.

2> Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Similar to fixed acidity, volatile acidity also has a long tail distribution. However, when we look at its log10 plots, we can see that the distribution looks a little binominal.

2.1> Total acidity (fixed acidity + volatile acidity)

3> Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

4> Residual Sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Residual sugar has a very long-tail distribution with many outliers. Some of these outliers are more than 9 standard deviations away from the median! It will be interesting to see how these outliers affect the quality of wine. In the log10 plots, the values are still very skewed, but it looks more like a normal distribution.

In the third plot, I removed the top five percent of data points to have a better understanding of core of the data

5> Cholorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides have distribution similar to residual sugar and have a strong concentration around the median. We also note a lot of outliers from the box plot. In the second plot, the top two percent of data points were removed to help understand the distribution of points around the median.

6> Free sulphur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Interesting to note that the free sulphur dioxide has a bi-modal distribution when we take the log10 transform. We also note that data is well spread out compared to the other features we have seen yet.

7> total.sulfur.dioxide Total suplhur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Total sulfur dioxide is similar in ways to free sulfur dioxide. I would argue that its points are not quite a dispersed, as there are fewer outliers and its interquartile range does not look quite as large. It also has a long-tail distribution, but when we look at its log10 plot, the points are rather normally distributed.

7.1 Bound sulphur dioxide

I created a new variable from the bound sulfur dioxide (total sulfur dioxide - free sulfur dioxide) to see if it has any intereseting pattern. Let’s see the comparison of all three

8> Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density has a very normal looking distribution with most of the values falling between 0.995 and 1. For comparison, water has a density of 1, so most of our wine is less dense than water. There are very few outliers.

9> pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Another normal looking distribution, with most of the pH values falling between 3.1 and 3.5. Much like with density, there are very few outliers

10> Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates is more long-tail than density and pH, it still looks rather normally distributed, as most of the values are clustered around 0.6. An interesting point about sulphates, is that some of its outliers are very far away from median. It will be interesting to see how that affects the quality of wine. Looking at its log10 plots, sulphates is much more normally distributed, and there are still some outliers, despite the transformation.

11> Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol has a long-tail distribution, with there only being a few outliers. Looking at the log10 plots does not reveal many new insights, except that it still has a long-tail distribution and looks oddly like the original plots. Most wines have less than 11% alcohol which is true to knowledge as I rarely have picked up a wine personally that is more than 11% in alcohol content.

12> Quality

Quality is on a 1-10 scale, which means that most of the wines we will look at in the analysis are average wines. It will be interesting to try to find what can make a wine very good or very bad, and to see if there is much correlation between the variables.

Univariate Analysis

What is the structure of your dataset?

The dataset is a tidy one and it has 1599 observations with 13 variables for each one. All of the observations are numerical. The first one is an index. The “quality” variable has only 6 discrete values: 3, 4, 5, 6, 7, 8.

What is/are the main feature(s) of interest in your dataset?

Quality is main interest in the dataset. It would be interesting to see which features contribute most to the quality of the wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect alcohol, pH, residual sugar, and total acidity will contribute most to the quality of the wine. After a little research on red wine, people seem to enjoy a red wine that is neither tart, nor sweet, nor dry, but smooth and wet. It would be interesting to see the composition of different features for the good quality(7 or 8) wines in our dataset.

Did you create any new variables from existing variables in the dataset?

I created three new variables:

  • bound.sulfur.dioxide: the result of subtract free.sulfur.dioxide from total.sulfur.dioxide
  • total.acidity: the result from addition of acidity and volatile.acidity.
  • class: to group wines in three classes -> bad (qualities 3 and 4), regular (qualities 5 and 6) and good (qualities 7 and 8).

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I noted that most of the wines rated either a 5 or 6. This could make it more difficult to determine what makes a good wine, as there is less data about them. Having more data about lesser wines would have also been useful to provide a better contrast between bad and good wines.

Most of the observations have an alcohol value between 9 and 12, with a median of 10. It is strange that wines with a quality of 5 tend to have less alcohol.

Regarding the new variables, bound sulfur dioxide (“nonfree.sulfur.dioxide”) tends to have bigger values than free sulfur dioxide. The percentage of free sulfur dioxide (“pfree.sulfur.dioxide”) has a distribution almost normal, with mean around 0.4.

For some of the features, I removed the top few percent of data points when looking at an additional plot. This was to have a better view of the core of the data, i.e. the interquartile range and how it is distrbuted.

As mentioned above I categorized the “quality” feature into bad,regular and good for better visualization plots for analysis later.

Bivariate Plots Section

1> Let’s plot and compare the distribution of different features with quality. We will use the new variable class to plot the density plots below

It seems that bad wines have a bigger volatile acidity, and they don’t have high citric acid values. Also they tend to have lower sulphate values. Good wines tend to have more alcohol.

2> Now let’s build a plot to see the corelation of the features with one another

Intresting! We find that our intial guesses were true about which factors would be realted to determine the quality of wine. We can see the alcohol and sulphates are postively correlated and volatile acidity is negatively correlated. My assumption that sugar will be an important factor for wine quality seems to be incorrect based on what the plot shows.

3>Let’s explore these few found realtions in depth

Let’s plot alcohol and quality

3.1> Alcohol vs Class

As per our observation, good quality wines have higher levels of alcohol.

3.2> Sulphates vs Class

Again, per our observtion we note that higher quality wines have higher levels of sulphates

3.3> volatile acidity vs Class

This plot helps us note that high quality wines have less amount of volatile acidity.

3.4> residual sugar vs class

I explored the impact of residual sugar on the quality as by feeling of gut I felt that it might impart taste/quality to the wine. But the plots show that residual sugar has no impact on the quality of the wine.

3.5> citric acid vs class

Good quality wines have higher levels of citric acid.

3.6> Further dive into Alcohol vs quality

There is a jump for alcohol variable between qualities 5 and 6. Maybe this is a separation between potentially bad wines and potentially good wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Based on our previous analysis, we have been checking some correlations. We were able to explore some tendancy across the quality for: “volatile.acidity”, “citric.acid”, “sulphates” and “alcohol”. All the cases except “volatile.acidity” are positive correlations. This is normal, because “volatile.acidity” is the concentration of acetic acid in wine, which present in too much concetration can lead to a sharp vinegar taste. For values of 5 in the “quality” variable the values for “alcohol” are very spread, although the tendency is that good wines (quality 7 or 8) have the highest median level of alcohol.

Furthermore, correlation matrices have given us a global overview of all pairwise relations in a numerical and graphical ways.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I was surprised to see that pH and volatile acidity are positively correleated, since a higher pH value means less acidity, but a higher volatile acidity means more acidity.

As expected, citric acid, acidity, and pH are all rather correlated, given that they all measure acidity.

Lastly, I was wrong in assuming that residual sugar may have a significant impact in the quality of the wine. Infact, it hardly contributes towards quality.

What was the strongest relationship you found?

The strongest relationship, ignoring that between “total.sulfur.dioxide” and “bound.sulfur.dioxide”, is the negative correlation (-0.68) between “fixed.acidity” and “pH”. Of course the correlation is negative for “pH” because a low pH indicates a very acidic environment.

Multivariate Plots Section

For this part, we will focus on our 4 main features we explored in the earlier plots and come up with a predictor model for quality.

For prediction purposes, we have two main problems: 1) Unbalanced spread of quality feature (too many regular wines) 2) The regular wines are very spread across feature values, so they are mixed with bad and good classes. Maybe what we should try is to predict good (or bad) wines, not to try to classify into the three classes.

Lets check only bad wines against good wines. In this case, we also add some density 2D maps in order to see where are located the clusters or groups for each combination of features:

Selecting only the “good” and “bad” wines helps us focus on the trends more specifically. Good wines have medium values of citric acid and low values of volatile acidity. Bad wines, on the other hand, medium-high volatile acidity and low citric acid. This is similar for combinations of “volatile.acidity” with “sulphates” or “alcohol”: good wines are upper left and bad wines are lower right. This tendency is similar in “alcohol” vs “citric.acid” or “sulphates”, although in this case good wines are on the upper right and bad wines on the lower left. For the combination “citric.acid” vs “sulphates”, we can appreciate more or less an horizontal line separating good and bad wines:

Now finally, I will build a simple linear model using our four main features(alcohol,volatile.acidity,sulphates and citric.acid).

## 
## Calls:
## m1: lm(formula = I(quality ~ alcohol), data = red_wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = red_wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = red_wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid, data = red_wine)
## 
## ================================================================
##                        m1         m2         m3         m4      
## ----------------------------------------------------------------
##   (Intercept)        1.875***   3.095***   2.611***   2.646***  
##                     (0.175)    (0.184)    (0.196)    (0.201)    
##   alcohol            0.361***   0.314***   0.309***   0.309***  
##                     (0.017)    (0.016)    (0.016)    (0.016)    
##   volatile.acidity             -1.384***  -1.221***  -1.265***  
##                                (0.095)    (0.097)    (0.113)    
##   sulphates                                0.679***   0.696***  
##                                           (0.101)    (0.103)    
##   citric.acid                                        -0.079     
##                                                      (0.104)    
## ----------------------------------------------------------------
##   R-squared              0.2        0.3        0.3        0.3   
##   adj. R-squared         0.2        0.3        0.3        0.3   
##   sigma                  0.7        0.7        0.7        0.7   
##   F                    468.3      370.4      268.9      201.8   
##   p                      0.0        0.0        0.0        0.0   
##   Log-likelihood     -1721.1    -1621.8    -1599.4    -1599.1   
##   Deviance             805.9      711.8      692.1      691.9   
##   AIC                 3448.1     3251.6     3208.8     3210.2   
##   BIC                 3464.2     3273.1     3235.7     3242.4   
##   N                   1599       1599       1599       1599     
## ================================================================

We can see that adding the “sulphates” adds small improvement but “citric.acid” do not improve the model(we saw this from our plots). The model is not such a good one as the R2 value is low(0.3 for model 3). Let’s check the accuracy of the model:

## [1] "Successful prediction by quality (0-10)"
## [1] 0.5822389
## [1] "Successful prediction by class"
## [1] 0.833646
## [1] "Let's see an example to predict the wine quality(alcohol= 11, volatile.acidity = 0.6 , sulphates= 0.7)"
## [1] "The predicted wine quality is "
## [1] 6

If we use rounded predicted quality values then we predict correctly 58% of the qualities. But if we use quality classes (bad, regular and good), then we increase the success rate to 83%.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

We compared our four features(volatile.acidity,citric.acid,sulphates and alcohol) in pair plots, taking into account different classes of wine. The regular wines had a large spread;there is not a good limit between a bad and a regular wine, or regular and a good wine. On the other hand, bad wines and good wines are more distinguishable from one another as we saw in the plot.

We noted that most of the good wines have medium values of citric acid and low values of volatile acidity. Bad wines usually have medium-high volatile acidity and low citric acid. This is similar for combinations of “volatile.acidity” with “sulphates” or “alcohol”: good wines are upper left and bad wines are lower right. This tendency is similar in “alcohol” vs “citric.acid” or “sulphates”, although in this case good wines are on the upper right and bad wines on the lower left. For the combination “citric.acid” vs “sulphates”, we can appreciate more or less a horizontal line separating good and bad wines.

Were there any interesting or surprising interactions between features?

Yes, based on the bivariates plot, it seems that there is a positive correlation between “citric.acid” and “quality”. But if we observe the scatter plots by class of wine (only good and bad), we do not see a clear cutoff of “citric.acid” feature to distinguish good and bad wines

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created simple linear models using our four main features. The first model includes only “alcohol” as predictor. Then next model add “volatile.acidity”. Model 3 adds also “sulphates”, and the last model adds “citric.acid”. The R2 values of our models are not very good, although the sucess rates could be a little misleading. One of the main problems is that we have a very unbalanced dataset (too many “regular” wines). Maybe the biggest problem for the model is to distinguish between bad and regular wines, and between good and regular wines.


Final Plots and Summary

Plot One

Description One

This plot shows the densities for the distributions of all features in the dataset. We created three quality classes for wine: bad (4 or lower quality values) in red, regular (5 or 6) in green and good (7 or higher) in blue and grouped the features. Those variables with less overlapping in their density curves could help us to distinguish between quality classes. Four of the best features for this purpose are: volatile acidity, citric acid, sulphates and alcohol. Other variables also could help us to detect a specific class, like fixed acidity (good wines) and % free sulfur dioxide (regular wines).

Note: text, values and ticks of Y-axis were removed for clarity

Plot Two

Description Two

Alcohol by volume and volatile acidity were the two chemical properties most closely related to quality in red wine. Alcohol had a positive relationship with quality, perhaps due to a higher concentration of flavor in wines with higher alcohol percentages. Volatile acidity had a negative relationship with quality rating, due to the fact that higher concentrations can lead to undesirable vinegar-like flavors. As evidenced by the two distinct regions in the plot, the lowest quality wines tended to have lower alcohol percentages and higher volatile acidity concentrations, while the higher quality wines had higher alcohol percentages and lower volatile acidity concentrations, in general.

Plot Three

Description Three

In this plot we show the pairwise comparison for the six combinations of the main four features. Each combination are represented in a scatter plot. We used a subset of the wines dataset selecting only wines with quality class bad or good. We also deleted some outliers (volatile acidity >= 1.5, citric acid >= 1 and sulphates >= 2). The idea is to show that these features could help to distinguish good wines from bad wines. We are omitting regular wines because their features are so spread that it is not easy to make a distinction; nevertheless, a person usually is not interested in detected a regular wine; he/she usually wants to detect a potential good wine or to avoid a bad wine.

These scatter plots also show density 2D maps for each class. This allows us to see regions or clusters of good wine and bad wine.


Reflection

We have been analysing a red wine dataset with almost 1,500 observations and 12 features. One of these features is the punctuation or quality for the wine. The objective was to analyse the other features to know their influence in wine quality. After the study of the different distributions for the features, taking into account the qualities, we determined four of the features as the most influential: volatile acidity, citric acid, sulphates and alcohol. After grouping the qualities in three classes (bad, regular and good), we saw that there was a correlation with the main features. This correlation is positive in all cases, except for volatile acidity whose correlation is negative. Multivariate analysis allowed us to see that combinations of the main features could help to determine different “spatial” regions for good wines and bad wines. We have decided that to predict regular wines does not have much sense: most of people usually want to detect a potential good wine (or avoid a bad wine).

According to our study, good wines seem to have lower volatile acidity, higher alcohol and medium-high sulphate values. Bad wines tend to have low values for citric acid; although we have seen, this feature does not improve our predictive models.

For the predictive model, we have been trying a simple linear model with only one main feature, and then adding one by one the other 3 main features. Although the R2 is small, the success rates are more or less high. But this is mainly because we have a problem of unbalanced data: too many “regular” class observations.

In the future work, we should try to improve our modelling procedures balancing the data. Also we could try some algorithm for parameters selection. Other machine learning algorithms could work better for this problem.